Make NVTE tensor handle pool size configurable#3090
Conversation
Signed-off-by: hongbinl <hongbinl@nvidia.com>
for more information, see https://pre-commit.ci
Greptile SummaryThis PR exposes two new runtime environment variables (
Confidence Score: 4/5Safe to merge; the change is backward-compatible (default pool size unchanged) and the new code paths are straightforward. The core logic is correct — validation guards against zero and overflow, the member-initializer ordering is right, and the singleton construction is safe. Two minor defects exist: partial string inputs like "50bad" are silently accepted as 50 due to no eof check in the shared getenv helper, and negative env var values wrap to SIZE_MAX and trigger the "too large" error rather than the intended "must be a positive integer" message. Both are input-validation edge cases with no correctness impact under documented usage. transformer_engine/common/transformer_engine.cpp — specifically the env-var parsing and validation in GetTensorHandlePoolSizeMB Important Files Changed
Sequence DiagramsequenceDiagram
participant Env as Environment Variable
participant Helper as GetTensorHandlePoolSizeMB
participant Cap as GetTensorHandlePoolCapacity
participant Alloc as TensorAllocator (singleton)
participant App as Application Code
App->>Alloc: first call to instance()
activate Alloc
Alloc->>Env: std::getenv("NVTE_TENSOR_HANDLE_POOL_SIZE_MB")
Env-->>Helper: raw string value (or null default 20)
Helper->>Helper: parse via getenv size_t
Helper->>Helper: "NVTE_CHECK pool_size_mb > 0"
Helper->>Helper: NVTE_CHECK pool_size_mb within bounds
Helper-->>Cap: pool_size_mb
Cap->>Cap: compute pool_size_bytes
Cap->>Cap: "NVTE_CHECK pool_size_bytes >= sizeof(Tensor)"
Cap-->>Alloc: MAX_TENSOR_NUM
Alloc->>Alloc: memory.reserve(MAX_TENSOR_NUM)
Alloc-->>App: singleton ready
deactivate Alloc
App->>Alloc: Allocate(mode, out, N)
Alloc->>Alloc: "check available >= N"
alt pool exhausted
Alloc-->>App: NVTE_CHECK error with pool_size_mb and env var hint
else space available
Alloc-->>App: NVTETensor handles
end
Reviews (1): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile |
Signed-off-by: hongbinl <hongbinl@nvidia.com>
for more information, see https://pre-commit.ci
|
I am not opposed to creating such a variable, but I would really like to see an example of such legitimate use which goes over this limit. Could you run the experiment that is failing for you with https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/common/transformer_engine.cpp#L487 set to true and send me the log of that? |
Summary
Motivation
Large model initialization paths can legitimately create more TE tensor handles than the current fixed-size pool allows, even when GPU and CPU memory are otherwise sufficient. Exposing the pool size as an environment variable avoids downstream source patches for these scale-dependent cases.
Testing